feat: compute per-output split metadata in merge engine#6359
Open
g-talbot wants to merge 2 commits intogtt/parquet-merge-policy-configfrom
Open
feat: compute per-output split metadata in merge engine#6359g-talbot wants to merge 2 commits intogtt/parquet-merge-policy-configfrom
g-talbot wants to merge 2 commits intogtt/parquet-merge-policy-configfrom
Conversation
fc6f90a to
720560d
Compare
The merge engine now extracts metric_names, time_range, and low_cardinality_tags from each output file's actual rows during the merge write pass. Previously, MergeOutputFile only contained physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstream metadata_aggregation function inferred logical metadata by unioning all input splits — which is incorrect when num_outputs > 1, since each output contains only a subset of the globally sorted rows. Now each MergeOutputFile carries: - metric_names: distinct metrics in this output's rows - time_range: min/max timestamp_secs from this output's rows - low_cardinality_tags: service names from this output's rows Reuses existing extract_metric_names, extract_service_names, and extract_time_range from split_writer (made pub(crate)). Includes test that verifies per-output metadata is computed from actual rows when merging 2 inputs into 2 outputs with different metric names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
720560d to
3e61af4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Stacked on #6351 (Phase 2). Should be merged before #6352 (Phase 3a).
The merge engine now extracts
metric_names,time_range, andlow_cardinality_tags(service) from each output file's actual rows during the merge write pass.Problem:
MergeOutputFilepreviously only had physical metadata (num_rows, size_bytes, row_keys, zonemaps). The downstreammetadata_aggregationfunction inferred logical metadata by unioning all input splits. This is incorrect whennum_outputs > 1— each output contains a different subset of the globally sorted rows and should have metadata reflecting only its own data.Fix: Each
MergeOutputFilenow carries per-output logical metadata extracted from thesorted_batchbefore writing. Reusesextract_metric_names,extract_service_names,extract_time_rangefromsplit_writer(madepub(crate)).Test plan
test_merge_per_output_metadata_from_actual_rows— verifies 2-output merge has correct per-output metric_names and time_rangetest_merge_multiple_outputswith per-output metadata assertionscargo clippyclean🤖 Generated with Claude Code